Learning Hebrew Roots: Machine Learning with Linguistic Constraints

نویسندگان

  • Ezra Daya
  • Dan Roth
  • Shuly Wintner
چکیده

The morphology of Semitic languages is unique in the sense that the major word-formation mechanism is an inherently non-concatenative process of interdigitation, whereby two morphemes, a root and a pattern, are interwoven. Identifying the root of a given word in a Semitic language is an important task, in some cases a crucial part of morphological analysis. It is also a non-trivial task, which many humans find challenging. We present a machine learning approach to the problem of extracting roots of Hebrew words. Given the large number of potential roots (thousands), we address the problem as one of combining several classifiers, each predicting the value of one of the root’s consonants. We show that when these predictors are combined by enforcing some fairly simple linguistics constraints, high accuracy, which compares favorably with human performance on this task, can be achieved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identifying Semitic Roots: Machine Learning with Linguistic Constraints

Words in Semitic languages are formed by combining two morphemes: a root and a pattern. The root consists of consonants only, by default three, and the pattern is a combination of vowels and consonants, with non-consecutive “slots” into which the root consonants are inserted. Identifying the root of a given word is an important task, considered to be an essential part of the morphological analy...

متن کامل

Learning to Identify Semitic Roots

The standard account of word-formation processes in Semitic languages describes words as combinations of two morphemes: a root and a pattern. The root consists of consonants only, by default three (although longer roots are known), called radicals. The pattern is a combination of vowels and, possibly, consonants too, with ‘slots’ into which the root consonants can be inserted. Words are created...

متن کامل

Distinguishing between True and False Stories using various Linguistic Features

This paper analyzes what linguistic features differentiate true and false stories written in Hebrew. To do so, we have defined four feature sets containing 145 features: POStags, quantitative, repetition, and special expressions. The examined corpus contains stories that were composed by 48 native Hebrew speakers who were asked to tell both false and true stories. Classification experiments on ...

متن کامل

Tagging a Hebrew Corpus: the Case of Participles

We report on an effort to build a corpus of Modern Hebrew tagged with parts of speech and morphology. We designed a tagset specific to Hebrew while focusing on four aspects: the tagset should be consistent with common linguistic knowledge; there should be maximal agreement among taggers as to the tags assigned to maintain consistency; the tagset should be useful for machine taggers and learning...

متن کامل

Pii: S0010-0277(01)00167-6

Does the productive use of language stem from the manipulation of mental variables (e.g. “noun”, “any consonant”)? If linguistic constraints appeal to variables, rather than instances (e.g. “dog”, “m”), then they should generalize to any representable novel instance, including instances that fall beyond the phonological space of a language. We test this prediction by investigating a constraint ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004